Predicting Employee Turnover: A Data-Driven HR Strategy at Salifort Motors¶
Description and Deliverables¶
This capstone project explores the factors influencing employee attrition at Salifort Motors, aiming to support the HR department in improving retention and overall employee satisfaction. By analyzing a dataset containing employee satisfaction scores, evaluation history, departmental assignments, and more, I built predictive models to identify employees most likely to leave the company.
Designed to simulate a real-world data science workflow, this project includes:
A fully documented Jupyter Notebook detailing the data cleaning, exploratory data analysis (EDA), modeling, and evaluation process.
A one-page executive summary for external stakeholders, highlighting key findings and actionable recommendations.
I approached the problem using machine learning models specifically, Decision Tree and Random Forest to balance accuracy with interpretability. Model performance was evaluated using accuracy, precision, recall, F1-score, and ROC-AUC.
Beyond technical outcomes, the project addresses ethical considerations in employee attrition modeling and includes visualizations to clearly communicate key trends. Troubleshooting steps, consulted resources, and methodological justifications are thoroughly documented throughout the notebook.
📌 Pace: Plan Stage¶
- Understand the business problem
- Define the objective: predict employee attrition to support HR decisions
- Identify the target variable:
left
Packages¶
In this project, I used several Python packages to handle data processing, visualization, modeling, and evaluation tasks:
- pandas and numpy for data manipulation and numerical operations
- matplotlib and seaborn for data visualization
- scikit-learn (
sklearn) for building machine learning models (Logistic Regression, Decision Tree, and Random Forest), hyperparameter tuning, and evaluating model performance - pickle for saving trained models for future use
These packages provide the core functionality required to execute a typical end-to-end data science workflow, from data wrangling to insight generation.
# Data Manipulation
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None) # Show all columns when displaying a DataFrame
# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Modeling - Algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
# Modeling - Utilities
from sklearn.model_selection import train_test_split, GridSearchCV
# Evaluation Metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, \
f1_score, confusion_matrix, ConfusionMatrixDisplay, classification_report
from sklearn.metrics import roc_auc_score
import warnings
warnings.filterwarnings("ignore", message="invalid value encountered in cast")
# Model Saving
import pickle
🧭 Step 1: Data Loading and Initial Exploration¶
df0 = pd.read_csv('HR_capstone_dataset.csv')
df0.head()
| satisfaction_level | last_evaluation | number_project | average_montly_hours | time_spend_company | Work_accident | left | promotion_last_5years | Department | salary | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.38 | 0.53 | 2 | 157 | 3 | 0 | 1 | 0 | sales | low |
| 1 | 0.80 | 0.86 | 5 | 262 | 6 | 0 | 1 | 0 | sales | medium |
| 2 | 0.11 | 0.88 | 7 | 272 | 4 | 0 | 1 | 0 | sales | medium |
| 3 | 0.72 | 0.87 | 5 | 223 | 5 | 0 | 1 | 0 | sales | low |
| 4 | 0.37 | 0.52 | 2 | 159 | 3 | 0 | 1 | 0 | sales | low |
df0.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 14999 entries, 0 to 14998 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 satisfaction_level 14999 non-null float64 1 last_evaluation 14999 non-null float64 2 number_project 14999 non-null int64 3 average_montly_hours 14999 non-null int64 4 time_spend_company 14999 non-null int64 5 Work_accident 14999 non-null int64 6 left 14999 non-null int64 7 promotion_last_5years 14999 non-null int64 8 Department 14999 non-null object 9 salary 14999 non-null object dtypes: float64(2), int64(6), object(2) memory usage: 1.1+ MB
df0.describe()
| satisfaction_level | last_evaluation | number_project | average_montly_hours | time_spend_company | Work_accident | left | promotion_last_5years | |
|---|---|---|---|---|---|---|---|---|
| count | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 |
| mean | 0.612834 | 0.716102 | 3.803054 | 201.050337 | 3.498233 | 0.144610 | 0.238083 | 0.021268 |
| std | 0.248631 | 0.171169 | 1.232592 | 49.943099 | 1.460136 | 0.351719 | 0.425924 | 0.144281 |
| min | 0.090000 | 0.360000 | 2.000000 | 96.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.440000 | 0.560000 | 3.000000 | 156.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 0.640000 | 0.720000 | 4.000000 | 200.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 0.820000 | 0.870000 | 5.000000 | 245.000000 | 4.000000 | 0.000000 | 0.000000 | 0.000000 |
| max | 1.000000 | 1.000000 | 7.000000 | 310.000000 | 10.000000 | 1.000000 | 1.000000 | 1.000000 |
📊 Step 2: Exploratory Data Analysis (EDA)¶
🔍 2.1: Column Overview¶
- Display column names to check for typos or inconsistencies
- Identify any columns that may need renaming
print("Original column names:")
print(df0.columns.tolist())
Original column names: ['satisfaction_level', 'last_evaluation', 'number_project', 'average_montly_hours', 'time_spend_company', 'Work_accident', 'left', 'promotion_last_5years', 'Department', 'salary']
df0 = df0.rename(columns={
'average_montly_hours': 'average_monthly_hours',
'time_spend_company': 'tenure',
'Work_accident': 'work_accident',
'Department': 'department'
})
# Check for any misspelled column names (e.g., 'montly' instead of 'monthly')
for col in df0.columns:
if "montly" in col:
print(f"Possible typo found: {col}")
print("Renamed columns:")
print(df0.columns.tolist())
Renamed columns: ['satisfaction_level', 'last_evaluation', 'number_project', 'average_monthly_hours', 'tenure', 'work_accident', 'left', 'promotion_last_5years', 'department', 'salary']
💡 Insight¶
Column names were cleaned and standardized for consistency and clarity. Typos like average_montly_hours were corrected, and camel case variables (e.g., Work_accident, Department) were converted to lowercase. These changes help avoid errors and make the dataset easier to work with throughout the analysis.
2.2: Missing Values Check¶
- Identify if there are any missing values in the dataset
- Confirm data completeness before proceeding
if df0.isna().sum().sum() > 0:
print("There are missing values in the dataset:")
print(df0.isna().sum()[df0.isna().sum() > 0])
else:
print("✅ There are no missing values in the dataset.")
✅ There are no missing values in the dataset.
💡 Insight¶
No missing values were found in the dataset. This ensures we can proceed with analysis and modeling without the need for imputation or row deletion.
♻️ 2.3: Duplicate Detection and Handling¶
- Count and inspect duplicate rows
- Drop them if they're exact matches and not analytically useful
# Count duplicates
num_duplicates = df0.duplicated().sum()
# Calculate percentage
percent_duplicates = (num_duplicates / len(df0)) * 100
# Print results
print(f"⚠️ There are {num_duplicates} duplicate rows in the dataset.")
print(f"That represents {percent_duplicates:.2f}% of the total data.")
df0[df0.duplicated()].head()
⚠️ There are 3008 duplicate rows in the dataset. That represents 20.05% of the total data.
| satisfaction_level | last_evaluation | number_project | average_monthly_hours | tenure | work_accident | left | promotion_last_5years | department | salary | |
|---|---|---|---|---|---|---|---|---|---|---|
| 396 | 0.46 | 0.57 | 2 | 139 | 3 | 0 | 1 | 0 | sales | low |
| 866 | 0.41 | 0.46 | 2 | 128 | 3 | 0 | 1 | 0 | accounting | low |
| 1317 | 0.37 | 0.51 | 2 | 127 | 3 | 0 | 1 | 0 | sales | medium |
| 1368 | 0.41 | 0.52 | 2 | 132 | 3 | 0 | 1 | 0 | RandD | low |
| 1461 | 0.42 | 0.53 | 2 | 142 | 3 | 0 | 1 | 0 | sales | low |
# Inspecting some rows that contains duplicates
df0[df0.duplicated()].head(n = 10)
| satisfaction_level | last_evaluation | number_project | average_monthly_hours | tenure | work_accident | left | promotion_last_5years | department | salary | |
|---|---|---|---|---|---|---|---|---|---|---|
| 396 | 0.46 | 0.57 | 2 | 139 | 3 | 0 | 1 | 0 | sales | low |
| 866 | 0.41 | 0.46 | 2 | 128 | 3 | 0 | 1 | 0 | accounting | low |
| 1317 | 0.37 | 0.51 | 2 | 127 | 3 | 0 | 1 | 0 | sales | medium |
| 1368 | 0.41 | 0.52 | 2 | 132 | 3 | 0 | 1 | 0 | RandD | low |
| 1461 | 0.42 | 0.53 | 2 | 142 | 3 | 0 | 1 | 0 | sales | low |
| 1516 | 0.40 | 0.50 | 2 | 127 | 3 | 0 | 1 | 0 | IT | low |
| 1616 | 0.37 | 0.46 | 2 | 156 | 3 | 0 | 1 | 0 | sales | low |
| 1696 | 0.39 | 0.56 | 2 | 160 | 3 | 0 | 1 | 0 | sales | low |
| 1833 | 0.10 | 0.85 | 6 | 266 | 4 | 0 | 1 | 0 | sales | low |
| 12000 | 0.38 | 0.53 | 2 | 157 | 3 | 0 | 1 | 0 | sales | low |
# Removing duplicates.
df1 = df0.drop_duplicates(keep = 'first')
df1.head()
| satisfaction_level | last_evaluation | number_project | average_monthly_hours | tenure | work_accident | left | promotion_last_5years | department | salary | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.38 | 0.53 | 2 | 157 | 3 | 0 | 1 | 0 | sales | low |
| 1 | 0.80 | 0.86 | 5 | 262 | 6 | 0 | 1 | 0 | sales | medium |
| 2 | 0.11 | 0.88 | 7 | 272 | 4 | 0 | 1 | 0 | sales | medium |
| 3 | 0.72 | 0.87 | 5 | 223 | 5 | 0 | 1 | 0 | sales | low |
| 4 | 0.37 | 0.52 | 2 | 159 | 3 | 0 | 1 | 0 | sales | low |
💡 Insight¶
Out of 14,999 records, 3,008 were identified as exact duplicates (20.05%) and were removed. This step ensured a cleaner dataset of 11,991 unique employee records, avoiding potential bias in model training.
📈 2.4: Feature Distribution Overview¶
Before checking for outliers, it’s important to understand how the numeric features are distributed across the dataset. This helps identify skewness, common value ranges, and potential anomalies.
# Plotting histograms for all numeric features
df1.hist(figsize=(14, 10), bins=30, edgecolor='black')
plt.suptitle('Distribution of Numeric Features', fontsize=16)
plt.tight_layout()
plt.show()
💡 Insight¶
Most numeric features are either categorical-like (e.g., number_project, tenure) or binary (e.g., work_accident, promotion_last_5years). A few features like satisfaction_level and average_monthly_hours show some variation, with noticeable clustering and possible multimodal patterns. This distribution check helps establish expectations before formal outlier detection and modeling.
🚩 2.5: Outlier Detection¶
plt.figure(figsize = (6, 5))
plt.title('Boxplot to detect outliers for tenure', fontsize = 12)
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)
sns.boxplot(x = df1['tenure'])
plt.show
<function matplotlib.pyplot.show(close=None, block=None)>
💡 Insight¶
The boxplot above reveals several outliers in the tenure column. While most employees stay with the company for 3 to 5 years, there are a few who have stayed for 6 to 10 years, making them statistically unusual based on the interquartile range (IQR). These outliers may indicate long-tenured employees or special retention cases worth further exploration.
🔄 2.6: Feature Relationships with Target (left)¶
Determining the number of rows containing outliers
# Computing the 25th percentile value in 'tenure'
percentile25 = df1['tenure'].quantile(0.25)
# Computing the 75th percentile value in 'tenure'
percentile75 = df1['tenure'].quantile(0.75)
# Computing the interquartile range in 'tenure'
iqr = percentile75 - percentile25
# Defining the upper limit and lower limit for non-outliers values in 'tenure'
upper_limit = percentile75 + 1.5 * iqr
lower_limit = percentile25 - 1.5 * iqr
print('Upper limit:', upper_limit)
print('Lower limit:', lower_limit)
# Identifying subset of the data, that contains outliers in 'tenure'
outliers = df1[(df1['tenure'] > upper_limit) | (df1['tenure'] < lower_limit)]
# Count how many rows in the data contain outliers in `tenure`
print("Number of rows in the data containing outliers in `tenure`:", len(outliers))
Upper limit: 5.5 Lower limit: 1.5 Number of rows in the data containing outliers in `tenure`: 824
💡 Insight¶
Based on the interquartile range (IQR) method, any employee with a tenure below 1.5 years or above 5.5 years is considered an outlier. A total of 824 outliers were identified in the tenure column, indicating a noteworthy number of employees who either left very early or stayed much longer than the typical 2–5 year range.
pAce: Analize Stage¶
- Perform EDA to analyze relationships between variables.
🔍 2.7: Employee Attrition Summary¶
How many employee has left the company, and what percentage all this figure represents?
# Getting the number of people who left vs stayed.
counts = df1['left'].value_counts()
percentages = df1['left'].value_counts(normalize=True) * 100
# Getting the percentage of people who left vs stayed.
print(f'✅ Total employees: {df1.shape[0]}')
print(f'🟢 Stayed: {counts[0]} ({percentages[0]:.2f}%)')
print(f'🔴 Left: {counts[1]} ({percentages[1]:.2f}%)')
✅ Total employees: 11991 🟢 Stayed: 10000 (83.40%) 🔴 Left: 1991 (16.60%)
💡 Insight¶
Among the 11,991 employees in the cleaned dataset, 1,991 employees (16.60%) left the company, while 10,000 (83.40%) remained. This class imbalance is important to keep in mind during modeling, as it can impact the performance of classification algorithms.
📊 2.8: Monthly Hours vs. Number of Projects (Boxplot + Histogram)¶
Plots to visualize relationships between variables in the data.
fig, ax = plt.subplots(1, 2, figsize = (22,8))
sns.boxplot(data = df1, x = 'average_monthly_hours', y = 'number_project', hue = 'left', orient = 'h', ax = ax [0])
ax[0].invert_yaxis()
ax[0].set_title('Monthly hours by number of projects', fontsize = '14')
tenure_stay = df1[df1['left']==0]['number_project']
tenure_left = df1[df1['left']==1]['number_project']
sns.histplot(data = df1, x = 'number_project', hue = 'left', multiple = 'dodge', shrink = 2, ax = ax[1])
ax[1].set_title('Number of projects histogram', fontsize = '14')
plt.show()
💡 Insight¶
The boxplot shows that employees with very low (2) or very high (6–7) project loads were more likely to leave the company. Those with moderate workloads (3–5 projects) tended to stay, suggesting that both underutilization and overwork may contribute to attrition.
Additionally, those who left with many projects also logged significantly higher average monthly hours, indicating a potential risk of burnout.
🔎 Checking employee attrition for high project load (7 projects)¶
We'll check how many employees with exactly 7 projects stayed versus left the company.
df1[df1['number_project']==7]['left'].value_counts()
left 1 145 Name: count, dtype: int64
💡 Insight¶
Every employee who had 7 projects left the company a total of 145 cases.
This strongly suggests that an excessive project load may be unsustainable and is a clear signal of potential employee burnout or disengagement.
Examine the average monthly hours vs satisfactions levels
plt.figure(figsize=(14, 7))
sns.scatterplot(data=df1, x='average_monthly_hours', y='satisfaction_level', hue='left', alpha=0.4)
plt.axvline(x=166.67, color='#ff6361', label='166.67 hrs./mo.', ls='--')
plt.legend(labels=['166.67 hrs./mo.', 'left', 'stayed'])
plt.title('Monthly hours by last evaluation score', fontsize='14');
💡 Insight¶
Monthly Hours vs. Satisfaction Level The scatterplot comparing average monthly hours to satisfaction level, colored by whether employees left or stayed, reveals two distinct patterns of attrition:
Overworked & Dissatisfied Employees
A dense cluster of employees with low satisfaction and very high monthly hours (above 250) chose to leave the company.
This suggests classic burnout employees working long hours while feeling dissatisfied are highly prone to attrition.
High-Performing Leavers
Surprisingly, many employees with high satisfaction scores and high work hours also left.
This could point to high-performing individuals who may have felt overburdened or undervalued despite good performance possibly leading to voluntary exits.
A vertical reference line at 166.67 hours/month (the estimated full-time workload) helps to distinguish typical from excessive work patterns:
Most employees who stayed worked between 150 and 250 hours/month, reporting moderate to high satisfaction.
Leavers cluster both below and above this band highlighting multiple paths to attrition: burnout, underutilization, or unmet expectations.
Takeaway: Both overwork and lack of recognition are potential drivers of turnover at Salifort Motors.
For the next visualization, it might be interesting to visualize satisfaction levels by tenure.
fig, ax = plt.subplots(1, 2, figsize = (22, 8))
# Boxplot showing distributions of `satisfaction_level` by tenure, comparing employees who stayed versus those who left
sns.boxplot(data = df1, x = 'satisfaction_level', y = 'tenure', hue = 'left', orient = 'h', ax = ax[0])
ax[0].invert_yaxis()
ax[0].set_title('Tenure Histogram', fontsize = '14')
# Histogram for showing distribution of `tenure`, comparing employees who stayed versus those who left
tenure_stay = df1[df1['left'] == 0] ['tenure']
tenure_left = df1[df1['left'] == 1] ['tenure']
sns.histplot(data = df1, x = 'tenure', hue = 'left', multiple = 'dodge', shrink = 5, ax = ax[1])
ax[1].set_title('Tenure Histogram', fontsize = '14')
plt.show
<function matplotlib.pyplot.show(close=None, block=None)>
💡 Insight¶
Satisfaction Level by Tenure
📊 Boxplot (Satisfaction Level vs. Tenure)¶
- Employees with low satisfaction and shorter tenure (2–4 years) show a higher likelihood of leaving the company.
- A noticeable cluster of employees who left is concentrated around the 4-year tenure mark, suggesting this is a critical point of attrition.
- Employees with longer tenure (8–10 years) tend to stay, showing relatively stable satisfaction levels.
📈 Histogram (Tenure Distribution by Status)¶
- The majority of employees have a tenure of 3–5 years, with attrition peaking in this range.
- Very few employees remain beyond 7+ years, and among those, almost all have stayed possibly indicating stronger loyalty or seniority.
💡 Takeaway:¶
The 4-year tenure mark appears to be a potential tipping point for employee dissatisfaction and departure. HR may consider introducing retention initiatives, promotions, or recognition programs targeted at employees approaching their fourth year.
As the next step in analyzing the data, I'll calculate the mean and median satisfaction scores of employees who left and those who didn't.
df1.groupby(['left'])['satisfaction_level'].agg(['mean', 'median'])
| mean | median | |
|---|---|---|
| left | ||
| 0 | 0.667365 | 0.69 |
| 1 | 0.440271 | 0.41 |
💡 Insight¶
Mean and Median Satisfaction Scores by Attrition Status
- Employees who left the company had a lower average satisfaction level (0.44) compared to those who stayed (0.67).
- The median satisfaction scores show a similar pattern: 0.41 for leavers vs. 0.69 for stayers.
- Interestingly, for employees who stayed, the mean is slightly lower than the median, suggesting the satisfaction distribution is left-skewed a small group of less satisfied employees might be lowering the average.
- This reinforces earlier insights: low satisfaction is strongly associated with employee attrition at Salifort Motors.
Next, to examine salary levels for different tenures.
fig, ax = plt.subplots(1, 2, figsize = (20, 6))
# Short-tenured employee
tenure_short = df1[df1['tenure'] < 7]
# Long-tenured employee
tenure_long = df1[df1['tenure'] > 6]
sns.histplot(data = tenure_short, x = 'tenure', hue = 'salary', discrete = 1,
hue_order = ['low', 'medium', 'high'], multiple = 'dodge', shrink = .5, ax = ax[0])
ax[0].set_title('Salary histogram by tenure: short-tenured people', fontsize = '14')
sns.histplot(data = tenure_long, x = 'tenure', hue = 'salary', discrete = 1,
hue_order = ['low', 'medium', 'high'], multiple = 'dodge', shrink = .4, ax = ax[1])
ax[1].set_title('Salary histogram by tenure: long-tenured people', fontsize = '14')
Text(0.5, 1.0, 'Salary histogram by tenure: long-tenured people')
💡 Insight¶
Salary Distribution by Tenure
- Among short-tenured employees (tenure < 7 years), the majority earn low or medium salaries, with very few in the high salary bracket.
- Among long-tenured employees (tenure ≥ 7 years), we observe a similar pattern most employees still fall within the low and medium salary ranges.
- There is no strong evidence that employees with longer tenure are more likely to receive higher salaries.
- This suggests that longevity at Salifort Motors doesn’t necessarily translate into higher compensation, which may affect employee motivation or retention among experienced staff.
Next, to explore whether there's a correlation between working long hours and receiving high evaluation scores. by creating a scatterplot of average_monthly_hours versus last_evaluation.
plt.figure(figsize = (14, 7))
sns.scatterplot(data = df1, x = 'average_monthly_hours', y = 'last_evaluation', hue = 'left', alpha = 0.4)
plt.axvline(x = 166.67, color = '#ff6361', label = '166.67 hrs./mo.', ls = '--')
plt.legend(labels = ['166.67 hrs./mo.', 'left', 'stayed'])
plt.title('Monthly hours by last evaluation score', fontsize='14');
📈 Insight: Average Monthly Hours vs. Last Evaluation Score¶
The scatterplot reveals two notable clusters among employees who left:
- Overworked High Performers: A dense concentration of leavers had high evaluation scores and worked significantly more than 167 hours per month, suggesting possible burnout or lack of recognition despite strong performance.
- Underperforming Underutilized Employees: Another group of employees who left worked slightly below the 166.67-hour threshold and received relatively low evaluations, potentially reflecting disengagement or lack of fit.
Additional observations:
- There appears to be a positive correlation between average monthly hours and evaluation score, though it is not perfectly linear.
- Very few employees reside in the upper-left quadrant (high evaluation, low hours), implying that working fewer hours rarely coincides with high performance ratings.
- The majority of employees—regardless of whether they stayed or left-tend to work well above the baseline 166.67 hours/month, possibly indicating a company-wide culture of long hours.
These patterns may point to two root causes of attrition: burnout among top performers and low engagement among underperformers.
Next examine wether employes who worked long hours were promoted in the last 5 years.
plt.figure(figsize = (12, 3))
sns.scatterplot(data = df1, x = 'average_monthly_hours', y = 'promotion_last_5years', hue = 'left', alpha = 0.4)
plt.axvline(x = 166.67, color = '#ff6361', ls = '--')
plt.legend(labels = ['166.67 hrs./mo.', 'left', 'stayed'])
plt.title('Monthly hours by promotion last five years', fontsize = '14');
💡 Insight¶
Monthly Hours vs. Promotion in the Last 5 Years
The scatterplot above reveals important patterns regarding promotion and workload:
- Very few employees who received promotions in the last five years ended up leaving the company.
- Among those who worked the longest hours, very few were promoted, suggesting a possible disconnect between effort and recognition.
- A significant number of employees who left the company were clustered on the far right of the plot (i.e., high average monthly hours), indicating potential burnout or dissatisfaction from lack of reward despite long work hours.
These patterns may suggest that high-performing employees who consistently work long hours might not feel adequately recognized, potentially contributing to turnover at Salifort Motors.
Next, to inspect how the employees who left are distributed across departments.
df1['department'].value_counts()
department sales 3239 technical 2244 support 1821 IT 976 RandD 694 product_mng 686 marketing 673 accounting 621 hr 601 management 436 Name: count, dtype: int64
plt.figure(figsize = (10, 5))
sns.histplot(data = df1, x = 'department', hue = 'left', discrete = 1,
hue_order = [0, 1], multiple = 'dodge', shrink = .5)
plt.xticks(rotation = 45)
plt.title('Counts of Stayed/Left by Department', fontsize = 12);
💡 Insight¶
Attrition Patterns Across Departments
- The chart shows that while absolute counts of employees vary by department (with
sales,technical, andsupporthaving the highest), the proportion of employees who left appears relatively consistent across departments. - No single department shows an exceptionally high or low attrition rate relative to its size.
- This suggests that department alone may not be a strong driver of employee turnover at Salifort Motors and that other factors such as workload, satisfaction, or promotion opportunities might play a more important role.
Lastly, inspecting strong correlations between variables in the data.
# Filter only numeric columns for correlation
numeric_df = df0.select_dtypes(include=['float64', 'int64'])
# Create the heatmap with only numeric columns
plt.figure(figsize=(10, 5))
heatmap = sns.heatmap(numeric_df.corr(),
vmin=-1,
vmax=1,
annot=True,
cmap=sns.color_palette('vlag', as_cmap=True))
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize': 12}, pad=12)
Text(0.5, 1.0, 'Correlation Heatmap')
🔍 Correlation Analysis¶
The heatmap above shows the pairwise correlations between numeric variables in the dataset.
Key Observations:
- Satisfaction level has the strongest negative correlation with attrition (
left), at -0.39, indicating that lower satisfaction is a strong predictor of employee departure. - Evaluation score, number of projects, and monthly working hours are all positively correlated with each other. This suggests that high performers often take on more projects and work longer hours possibly increasing their risk of burnout.
- The correlation between promotion in the last 5 years and other variables is generally weak. This might indicate that promotions are either rare or not directly tied to workload or performance metrics.
- Surprisingly, tenure has only a modest positive correlation with attrition. This aligns with earlier findings showing that very short or very long tenure durations influence turnover differently.
💡 Insight¶
- Employees who work long hours, handle many projects, and receive limited recognition (via promotions or evaluation) are more likely to leave.
- This pattern may reflect organizational strain or poor management practices, where high-performing individuals are underappreciated or overburdened.
- The heatmap complements earlier EDA findings especially on burnout risk and the importance of employee satisfaction.
- Interestingly, employees with long tenures (over 6 years) tend to stay, which may indicate either job stability among senior staff or selective retention practices.
paCe: Construct Stage¶
- Build models to predict employee attrition.
- Choose appropriate modeling techniques.
- Confirm model assumptions.
- Evaluate results and compare model performance.
Logistic Regression Assumptions¶
- ✅ Outcome is binary
- ✅ Observations are independent
- 🔄 No severe multicollinearity
- ⚠️ No extreme outliers (handled in EDA)
- 🔄 Linear relationship between features and log-odds
- ✅ Sufficiently large dataset (11,991 rows)
Modeling Approach A: Logistic Regression¶
Preprocessing Steps:
department: categorical to one-hot encodingsalary: ordinal to numerical (low=0, medium=1, high=2)
Next cell: Start with
LabelEncoderormap()for salary andpd.get_dummies()for department.
Step 3. Model Building, Step 4. Results and Evaluation¶
- Fit a model that predicts the outcome variable using two or more independent variables
- Check model assumptions
- Evaluate the model
# Copy the dataframe
df_enc = df1.copy()
# Encode the `salary` column as an ordinal numeric category
df_enc['salary'] = (
df_enc['salary'].astype('category')
.cat.set_categories(['low', 'medium', 'high'])
.cat.codes
)
# Dummy encode the `department` column
df_enc = pd.get_dummies(df_enc, drop_first=False)
df_enc.head()
| satisfaction_level | last_evaluation | number_project | average_monthly_hours | tenure | work_accident | left | promotion_last_5years | salary | department_IT | department_RandD | department_accounting | department_hr | department_management | department_marketing | department_product_mng | department_sales | department_support | department_technical | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.38 | 0.53 | 2 | 157 | 3 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
| 1 | 0.80 | 0.86 | 5 | 262 | 6 | 0 | 1 | 0 | 1 | False | False | False | False | False | False | False | True | False | False |
| 2 | 0.11 | 0.88 | 7 | 272 | 4 | 0 | 1 | 0 | 1 | False | False | False | False | False | False | False | True | False | False |
| 3 | 0.72 | 0.87 | 5 | 223 | 5 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
| 4 | 0.37 | 0.52 | 2 | 159 | 3 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
🔍 Correlation Heatmap (Subset of Features)¶
This heatmap shows pairwise correlations between selected numerical features in the dataset. It's useful for identifying multicollinearity and spotting interesting relationships before model building.
plt.figure(figsize = (10, 4))
sns.heatmap(df_enc[['satisfaction_level', 'last_evaluation', 'number_project', 'average_monthly_hours', 'tenure']]
.corr(), annot = True, cmap = 'crest')
plt.title('Correlation Heatmap of Selected Features', fontsize=12, pad=12)
plt.show
<function matplotlib.pyplot.show(close=None, block=None)>
💡 Insight¶
- Most features in this subset have weak to moderate correlation with each other.
- No correlation is strong enough (above ±0.8) to indicate multicollinearity concerns for regression.
- Notably,
number_project,average_monthly_hours, andlast_evaluationare moderately correlated these relationships may influence model behavior.
📊 Comparison of Employee Turnover Across Departments¶
# In the legend, 0 (purple color) represents employees who did not leave, 1 (red color) represents employees who left
pd.crosstab(df1['department'], df1['left']).plot(kind = 'bar', color = 'mr')
plt.title('Counts of Employee who left versus Stayed across departments')
plt.ylabel('Employee Count')
plt.xlabel('Department')
plt.show
<function matplotlib.pyplot.show(close=None, block=None)>
💡 Insight¶
- While departments like sales, support, and technical had the highest overall headcount, they also showed notable turnover.
- In contrast, departments like management, accounting, and RandD had fewer employees leave.
- However, no department stands out with unusually high attrition proportionally suggesting that turnover is a company-wide issue rather than isolated to specific teams.
Modeling Approach A: Logistic Regression¶
This section implements the logistic regression model for predicting employee attrition. The process includes data cleaning, outlier removal, target/feature selection, model training, and performance evaluation.
✂️ Remove Outliers from tenure Column¶
Since logistic regression is sensitive to outliers, we remove the extreme values in the tenure column that were previously identified.
df_logistic_reg = df_enc [(df_enc['tenure'] >= lower_limit) & (df_enc['tenure'] <= upper_limit)]
df_logistic_reg.head()
| satisfaction_level | last_evaluation | number_project | average_monthly_hours | tenure | work_accident | left | promotion_last_5years | salary | department_IT | department_RandD | department_accounting | department_hr | department_management | department_marketing | department_product_mng | department_sales | department_support | department_technical | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.38 | 0.53 | 2 | 157 | 3 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
| 2 | 0.11 | 0.88 | 7 | 272 | 4 | 0 | 1 | 0 | 1 | False | False | False | False | False | False | False | True | False | False |
| 3 | 0.72 | 0.87 | 5 | 223 | 5 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
| 4 | 0.37 | 0.52 | 2 | 159 | 3 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
| 5 | 0.41 | 0.50 | 2 | 153 | 3 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
🎯 Define Target and Features¶
y: The target variable whether the employee left the company.X: All remaining features used for prediction, excludingleft.
y = df_logistic_reg['left']
y.head()
0 1 2 1 3 1 4 1 5 1 Name: left, dtype: int64
X = df_logistic_reg.drop('left', axis = 1)
X.head()
| satisfaction_level | last_evaluation | number_project | average_monthly_hours | tenure | work_accident | promotion_last_5years | salary | department_IT | department_RandD | department_accounting | department_hr | department_management | department_marketing | department_product_mng | department_sales | department_support | department_technical | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.38 | 0.53 | 2 | 157 | 3 | 0 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
| 2 | 0.11 | 0.88 | 7 | 272 | 4 | 0 | 0 | 1 | False | False | False | False | False | False | False | True | False | False |
| 3 | 0.72 | 0.87 | 5 | 223 | 5 | 0 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
| 4 | 0.37 | 0.52 | 2 | 159 | 3 | 0 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
| 5 | 0.41 | 0.50 | 2 | 153 | 3 | 0 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
🧪 Split the Dataset for Training and Testing¶
We split the dataset into training and test sets using a 75/25 ratio and apply stratification to preserve class balance in both sets.
# Split the data into training set and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, stratify = y, random_state = 42)
🔨 Construct and Fit the Logistic Regression Model¶
We now create a logistic regression model and train it on the training data.
# Construct a logistic regression model and fit it to the training dataset
log_clf = LogisticRegression(random_state = 42, max_iter = 1000).fit(X_train, y_train)
Test the logistic regression model:¶
Using the model to make predictions on the test set.
y_pred = log_clf.predict(X_test)
Evaluating Model Predictions: Confusion Matrix¶
# COmuting values for confusion matrix
log_cm = confusion_matrix(y_test, y_pred, labels = log_clf.classes_)
# Creating the display of confusion matrix
log_disp = ConfusionMatrixDisplay(confusion_matrix = log_cm,
display_labels = log_clf.classes_)
# Now plot the confusion matrix
log_disp.plot(values_format = '')
plt.show()
True negatives: (2165) — The number of people who did not leave and were correctly predicted as such by the model.
False positives: (156) — The number of people who did not leave but were incorrectly predicted to leave.
False negatives: (348) — The number of people who left but the model failed to predict their departure.
True positives: (123) — The number of people who left and were correctly predicted to leave.
📌 Note: A perfect model would predict all employees correctly, with zero false positives or false negatives.
🔍 Evaluate Model Performance¶
Next, I’ll generate a classification report including precision, recall, F1-score, and accuracy to assess how well the logistic regression model performs.
Before interpreting the results, it’s important to check the class distribution in the target variable (left) to understand whether any imbalance may influence these metrics.
df_logistic_reg['left'].value_counts(normalize = True)
left 0 0.831468 1 0.168532 Name: proportion, dtype: float64
There is an approximately 83%-17% split.
So the data is not perfectly balanced, but it is not too imbalanced. If it was more severely imbalanced, I might want to resample the data to make it more balanced. In this case, I can use this data without modifying the class balance and continue evaluating the model.
🧮 Evaluate Model with Classification Report¶
Let’s compute precision, recall, F1-score, and overall accuracy to assess how well the model performs on the test set.
target_names = ['Predicted would not leave', 'Predicted would leave']
print(classification_report(y_test, y_pred, target_names = target_names))
precision recall f1-score support
Predicted would not leave 0.86 0.93 0.90 2321
Predicted would leave 0.44 0.26 0.33 471
accuracy 0.82 2792
macro avg 0.65 0.60 0.61 2792
weighted avg 0.79 0.82 0.80 2792
💡 Insight¶
Model Performance Evaluation
- The model performs well on predicting employees who did not leave (precision: 0.86, recall: 0.93).
- However, it struggles with predicting those who left (precision: 0.44, recall: 0.26), which is common in imbalanced classification tasks.
- This suggests the model is more conservative and favors predicting employee retention over attrition potentially minimizing false alarms but missing many true churners.
Modeling Approach B: Tree-based Model¶
This approach covers implementation of Decision Tree and Random Forest.
🎯 Define Target and Features¶
y: The target variable — whether the employee left the company.X: All remaining features used for prediction, excludingleft.
y = df_enc['left']
y.head()
0 1 1 1 2 1 3 1 4 1 Name: left, dtype: int64
# Selecting the features
X = df_enc.drop('left', axis = 1)
X.head()
| satisfaction_level | last_evaluation | number_project | average_monthly_hours | tenure | work_accident | promotion_last_5years | salary | department_IT | department_RandD | department_accounting | department_hr | department_management | department_marketing | department_product_mng | department_sales | department_support | department_technical | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.38 | 0.53 | 2 | 157 | 3 | 0 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
| 1 | 0.80 | 0.86 | 5 | 262 | 6 | 0 | 0 | 1 | False | False | False | False | False | False | False | True | False | False |
| 2 | 0.11 | 0.88 | 7 | 272 | 4 | 0 | 0 | 1 | False | False | False | False | False | False | False | True | False | False |
| 3 | 0.72 | 0.87 | 5 | 223 | 5 | 0 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
| 4 | 0.37 | 0.52 | 2 | 159 | 3 | 0 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
📦 Split Data into Training and Testing Sets¶
# Split the data into training, validating, and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, stratify = y, random_state = 0)
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train distribution:\n{y_train.value_counts(normalize=True)}")
X_train shape: (8993, 18) X_test shape: (2998, 18) y_train distribution: left 0 0.833982 1 0.166018 Name: proportion, dtype: float64
Decision tree - Round 1
Construct a decision tree model and set up cross-validated grid-search to exhuastively search for the best model parameters.
tree = DecisionTreeClassifier(random_state=0)
cv_params = {
'max_depth': [4, 6, 8, None],
'min_samples_leaf': [2, 5, 1],
'min_samples_split': [2, 4, 6]
}
scoring = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
tree1 = GridSearchCV(tree, cv_params, scoring=scoring, cv=4, refit='roc_auc')
Fit the Decision Tree Model to the Training Data¶
%%time
tree1.fit(X_train, y_train)
CPU times: total: 5.56 s Wall time: 5.59 s
GridSearchCV(cv=4, estimator=DecisionTreeClassifier(random_state=0),
param_grid={'max_depth': [4, 6, 8, None],
'min_samples_leaf': [2, 5, 1],
'min_samples_split': [2, 4, 6]},
refit='roc_auc',
scoring=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=4, estimator=DecisionTreeClassifier(random_state=0),
param_grid={'max_depth': [4, 6, 8, None],
'min_samples_leaf': [2, 5, 1],
'min_samples_split': [2, 4, 6]},
refit='roc_auc',
scoring=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'])DecisionTreeClassifier(max_depth=4, min_samples_leaf=5, random_state=0)
DecisionTreeClassifier(max_depth=4, min_samples_leaf=5, random_state=0)
✅ The decision tree model was successfully trained using grid search with 4-fold cross-validation.
The best parameters selected were:max_depth=4andmin_samples_leaf=5.
These constraints help prevent overfitting and improve generalization to unseen data.
🌳 Best Parameters and AUC Score for Decision Tree (Round 1)¶
# checking the best parameters
tree1.best_params_
{'max_depth': 4, 'min_samples_leaf': 5, 'min_samples_split': 2}
tree1.best_score_
0.969819392792457
💡 Insight¶
This is a strong AUC score (~0.97), indicating that the decision tree model is highly effective at classifying employees based on the training data.
The selected parameters also suggest a tree that balances depth and generalization, helping to avoid overfitting.
📊 Extract and Display Cross-Validation Metrics¶
def make_results(model_name: str, model_object, metric: str):
"""
Returns a summary table of cross-validated performance metrics for a trained model.
Parameters:
model_name (str): Custom label for the model (e.g., 'Decision Tree')
model_object (GridSearchCV): The trained model object from GridSearchCV
metric (str): The main evaluation metric to identify the best model ('auc', 'precision', etc.)
Returns:
table (DataFrame): A one-row DataFrame containing precision, recall, F1, accuracy, and AUC
"""
# Map the metric name to its corresponding column in cv_results_
metric_dict = {
'auc': 'mean_test_roc_auc',
'precision': 'mean_test_precision',
'recall': 'mean_test_recall',
'f1': 'mean_test_f1',
'accuracy': 'mean_test_accuracy'
}
# Extract all results into a DataFrame
cv_results = pd.DataFrame(model_object.cv_results_)
# Get the row with the best score for the selected metric
best_row = cv_results.iloc[cv_results[metric_dict[metric]].idxmax(), :]
# Collect performance metrics
table = pd.DataFrame({
'model': [model_name],
'precision': [best_row.mean_test_precision],
'recall': [best_row.mean_test_recall],
'F1': [best_row.mean_test_f1],
'accuracy': [best_row.mean_test_accuracy],
'auc': [best_row.mean_test_roc_auc]
})
return table
# Generate CV results table for the decision tree model
tree1_cv_results = make_results('Decision Tree CV', tree1, 'auc')
tree1_cv_results
| model | precision | recall | F1 | accuracy | auc | |
|---|---|---|---|---|---|---|
| 0 | Decision Tree CV | 0.914552 | 0.916949 | 0.915707 | 0.971978 | 0.969819 |
💡 Insight¶
All of these scores from the decision tree model are strong indicators of good model performance.
The model demonstrates high accuracy and AUC, confirming it performs well across multiple validation folds.
However, it's important to note that decision trees can be prone to overfitting.
To address this and improve generalization, the next step is to construct a random forest model, which aggregates predictions from multiple trees to reduce variance and boost performance.
🌲 Random Forest – Round 1¶
- I now construct a Random Forest model and apply grid search with cross-validation to identify the best set of hyperparameters.
Note: Random Forests are ensemble models that reduce overfitting by combining predictions from multiple decision trees.
🛠️ Model Setup and Parameter Grid¶
# Instantiate model
rf = RandomForestClassifier(random_state=0)
# Define hyperparameter grid
cv_params = {
'max_depth': [3, 5, None],
'max_features': ['sqrt', None],
'max_samples': [0.7, 1.0],
'min_samples_leaf': [1, 2, 3],
'min_samples_split': [2, 3, 4],
'n_estimators': [300, 500]
}
# Define scoring metrics
scoring = ['accuracy', 'precision', 'f1', 'recall', 'roc_auc']
🚀 Fit the Random Forest Model to the Training Data¶
%%time
# Grid search with cross-validation (4-fold)
rf1 = GridSearchCV(rf, cv_params, scoring=scoring, cv=4, refit='roc_auc')
rf1.fit(X_train, y_train)
CPU times: total: 39min 48s Wall time: 40min 11s
GridSearchCV(cv=4, estimator=RandomForestClassifier(random_state=0),
param_grid={'max_depth': [3, 5, None],
'max_features': ['sqrt', None],
'max_samples': [0.7, 1.0],
'min_samples_leaf': [1, 2, 3],
'min_samples_split': [2, 3, 4],
'n_estimators': [300, 500]},
refit='roc_auc',
scoring=['accuracy', 'precision', 'f1', 'recall', 'roc_auc'])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=4, estimator=RandomForestClassifier(random_state=0),
param_grid={'max_depth': [3, 5, None],
'max_features': ['sqrt', None],
'max_samples': [0.7, 1.0],
'min_samples_leaf': [1, 2, 3],
'min_samples_split': [2, 3, 4],
'n_estimators': [300, 500]},
refit='roc_auc',
scoring=['accuracy', 'precision', 'f1', 'recall', 'roc_auc'])RandomForestClassifier(max_depth=5, max_features=None, max_samples=0.7,
min_samples_split=4, n_estimators=500, random_state=0)RandomForestClassifier(max_depth=5, max_features=None, max_samples=0.7,
min_samples_split=4, n_estimators=500, random_state=0)💡 Insight¶
The random forest model completed a comprehensive grid search and selected hyperparameters that balance tree depth, sample size, and the number of estimators.
With max_depth=5, max_samples=0.7, and n_estimators=500, the model is optimized to reduce overfitting and improve generalization.
This setup is expected to perform more robustly than a single decision tree, especially when evaluated on unseen data.
💾 Specify Save Location for the Pickled Model¶
# Define the directory path where the model will be saved
path = "C:\\Users\\wodoo\\Desktop\\model"
📦 Save model as a pickle file¶
Define functions to pickle the model and read in the model.
def write_pickle(path, model_object, save_as: str):
"""
Save a trained model as a .pickle file in the specified path.
Parameters:
path (str): Folder location where the model will be saved.
model_object: Trained model object to be pickled.
save_as (str): Filename (without extension) to save the model as.
"""
with open(path + save_as + '.pickle', 'wb') as to_write:
pickle.dump(model_object, to_write)
📂 Load model from a pickle file¶
def read_pickle(path, saved_model_name: str):
"""
Load a pickled model from the specified path.
Parameters:
path (str): Folder location where the model is stored.
saved_model_name (str): Filename (without extension) of the saved model.
Returns:
model: The loaded model object.
"""
with open(path + saved_model_name + '.pickle', 'rb') as to_read:
model = pickle.load(to_read)
return model
📦 Save and Load the Random Forest Model Using Pickle¶
# Write pickle
write_pickle(path, rf1, 'hr_rf1')
# Read pickle
rf1 = read_pickle(path, 'hr_rf1')
🌲 Best Parameters and AUC Score for Random Forest (Round 1)¶
# Check best AUC score on CV
rf1.best_score_
0.9804250949807172
# Check best params
rf1.best_params_
{'max_depth': 5,
'max_features': None,
'max_samples': 0.7,
'min_samples_leaf': 1,
'min_samples_split': 4,
'n_estimators': 500}
💡 Insight¶
The random forest model achieved a very strong AUC score of ~0.98 during cross-validation, outperforming the decision tree from Round 1.
The selected hyperparameters indicate a deep ensemble of 500 trees with 70% sampling and minimal pruning, allowing the model to capture complex patterns in the data while still generalizing well.
📊 Compare Evaluation Scores: Decision Tree vs Random Forest (Round 1)¶
We now compare the cross-validation results of both models to determine which performs better on the training data.
# Get all CV scores
rf1_cv_results = make_results('random forest cv', rf1, 'auc')
print(tree1_cv_results)
print(rf1_cv_results)
model precision recall F1 accuracy auc
0 Decision Tree CV 0.914552 0.916949 0.915707 0.971978 0.969819
model precision recall F1 accuracy auc
0 random forest cv 0.950023 0.915614 0.932467 0.977983 0.980425
💡 Insight¶
The random forest model outperforms the decision tree across all evaluation metrics except for recall, where the difference is minimal (~0.001).
This suggests that the random forest provides better overall predictive performance while maintaining nearly identical sensitivity.
As expected, the ensemble approach offers improved robustness and accuracy over a single decision tree.
🧪 Final Model Evaluation on Test Set¶
To assess how well the model generalizes to new data, I evaluate its performance on the unseen test set.
Define a function that calculates key performance metrics based on the model’s predictions.
def get_scores(model_name: str, model, X_test_data, y_test_data):
"""
Generate a table of performance metrics for a fitted model on the test set.
Parameters:
model_name (str): Name of the model for the output table.
model (GridSearchCV): Trained GridSearchCV object with .best_estimator_.
X_test_data (array): Feature data from the test set.
y_test_data (array): Target values from the test set.
Returns:
DataFrame: Precision, recall, F1, accuracy, and AUC scores.
"""
preds = model.best_estimator_.predict(X_test_data)
auc = roc_auc_score(y_test_data, preds)
accuracy = accuracy_score(y_test_data, preds)
precision = precision_score(y_test_data, preds)
recall = recall_score(y_test_data, preds)
f1 = f1_score(y_test_data, preds)
return pd.DataFrame({
'model': [model_name],
'precision': [precision],
'recall': [recall],
'f1': [f1],
'accuracy': [accuracy],
'AUC': [auc]
})
✅ Predict on Test Set Using Final Model¶
# Get predictions on test data
rf1_test_scores = get_scores('random forest1 test', rf1, X_test, y_test)
rf1_test_scores
| model | precision | recall | f1 | accuracy | AUC | |
|---|---|---|---|---|---|---|
| 0 | random forest1 test | 0.964211 | 0.919679 | 0.941418 | 0.980987 | 0.956439 |
💡 Insight¶
The model achieved strong performance on the test set, closely matching the validation results.
This consistency suggests that the random forest model generalizes well to unseen data and is not overfitting.
Since the test set was reserved exclusively for final evaluation, we can be confident that these metrics reflect real-world performance.
🔧 Feature Engineering¶
Although the evaluation scores from Round 1 were very strong, they might be overly optimistic due to potential data leakage.
Data leakage occurs when the model has access to information during training that would not realistically be available in a real-world deployment scenario. This can lead to unrealistically high performance that fails to generalize.
In this case, two features stand out:
- satisfaction_level: It’s unlikely the company would have reliable satisfaction data for all employees, especially in advance of their departure.
- average_monthly_hours: This metric may also reflect hindsight behavior — such as reduced hours after someone has already disengaged or been flagged for termination.
✅ Plan for Round 2¶
- I will drop
satisfaction_levelto prevent relying on a potentially leaked or unavailable feature. - I will create a new binary feature called
overworked, which captures whether an employee is working significantly more than the expected average hours (e.g., >200 hours/month).
This helps represent workload pressure without using raw hour counts that could be misleading.
These changes are intended to make the model more realistic and applicable in a real HR setting, where clean, predictive signals matter most.
🗂️ Drop satisfaction_level to Prevent Potential Data Leakage¶
# Drop `satisfaction_level` and save the result in a new DataFrame
df2 = df_enc.drop('satisfaction_level', axis=1)
df2.head()
| last_evaluation | number_project | average_monthly_hours | tenure | work_accident | left | promotion_last_5years | salary | department_IT | department_RandD | department_accounting | department_hr | department_management | department_marketing | department_product_mng | department_sales | department_support | department_technical | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.53 | 2 | 157 | 3 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
| 1 | 0.86 | 5 | 262 | 6 | 0 | 1 | 0 | 1 | False | False | False | False | False | False | False | True | False | False |
| 2 | 0.88 | 7 | 272 | 4 | 0 | 1 | 0 | 1 | False | False | False | False | False | False | False | True | False | False |
| 3 | 0.87 | 5 | 223 | 5 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
| 4 | 0.52 | 2 | 159 | 3 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False |
🔍 Inspect Workload: Prepare for Creating the overworked Feature¶
# Temporarily copy `average_monthly_hours` into new column `overworked`
df2['overworked'] = df2['average_monthly_hours']
# Check distribution range of working hours
print('Max hours:', df2['overworked'].max())
print('Min hours:', df2['overworked'].min())
Max hours: 310 Min hours: 96
💡 Insight¶
An average full-time employee working 8 hours/day, 5 days/week, for 50 weeks a year would clock around 166.67 hours per month.
To flag unusually high workloads, I'll define an employee as overworked if they average more than 175 hours per month.
This threshold will allow me to create a new binary feature:
1= overworked0= not overworked
The transformation will be done using:
df2['overworked'] = (df2['overworked'] > 175).astype(int)
🔧 Create Binary overworked Feature Based on Monthly Hours¶
# Define `overworked` as working more than 175 hours/month
df2['overworked'] = (df2['overworked'] > 175).astype(int)
df2['overworked'].head()
0 0 1 1 2 1 3 1 4 0 Name: overworked, dtype: int32
🧹 Drop average_monthly_hours (Replaced by overworked)¶
# Drop the now-redundant average_monthly_hours column
df2 = df2.drop('average_monthly_hours', axis=1)
df2.head()
| last_evaluation | number_project | tenure | work_accident | left | promotion_last_5years | salary | department_IT | department_RandD | department_accounting | department_hr | department_management | department_marketing | department_product_mng | department_sales | department_support | department_technical | overworked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.53 | 2 | 3 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False | 0 |
| 1 | 0.86 | 5 | 6 | 0 | 1 | 0 | 1 | False | False | False | False | False | False | False | True | False | False | 1 |
| 2 | 0.88 | 7 | 4 | 0 | 1 | 0 | 1 | False | False | False | False | False | False | False | True | False | False | 1 |
| 3 | 0.87 | 5 | 5 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False | 1 |
| 4 | 0.52 | 2 | 3 | 0 | 1 | 0 | 0 | False | False | False | False | False | False | False | True | False | False | 0 |
🎯 Define Target and Features (Post Feature Engineering)¶
# Isolate the target variable
y = df2['left']
# Select the remaining features
X = df2.drop('left', axis=1)
📦 Split Data into Training and Testing Sets¶
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, stratify = y, random_state = 0)
🌳 Decision Tree – Round 2 (with Feature Engineering)¶
# Instantiate a decision tree classifier
tree = DecisionTreeClassifier(random_state=0)
# Define hyperparameter grid
cv_params = {
'max_depth': [4, 6, 8, None],
'min_samples_leaf': [2, 5, 1],
'min_samples_split': [2, 4, 6]
}
# Define scoring metrics
scoring = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
🔄 Fit the Decision Tree Model (Round 2)¶
%%time
tree2 = GridSearchCV(tree, cv_params, scoring=scoring, cv=4, refit='roc_auc')
tree2.fit(X_train, y_train)
CPU times: total: 3.73 s Wall time: 3.78 s
GridSearchCV(cv=4, estimator=DecisionTreeClassifier(random_state=0),
param_grid={'max_depth': [4, 6, 8, None],
'min_samples_leaf': [2, 5, 1],
'min_samples_split': [2, 4, 6]},
refit='roc_auc',
scoring=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=4, estimator=DecisionTreeClassifier(random_state=0),
param_grid={'max_depth': [4, 6, 8, None],
'min_samples_leaf': [2, 5, 1],
'min_samples_split': [2, 4, 6]},
refit='roc_auc',
scoring=['accuracy', 'precision', 'recall', 'f1', 'roc_auc'])DecisionTreeClassifier(max_depth=6, min_samples_leaf=2, min_samples_split=6,
random_state=0)DecisionTreeClassifier(max_depth=6, min_samples_leaf=2, min_samples_split=6,
random_state=0)🌳 Best Parameters and AUC Score for Decision Tree (Round 2)¶
# Check best params
tree2.best_params_
{'max_depth': 6, 'min_samples_leaf': 2, 'min_samples_split': 6}
# Check best AUC score on CV
tree2.best_score_
0.9586752505340426
💡 Insight¶
This model performs very well, even after removing features like satisfaction_level and average_monthly_hours.
The strong AUC (~0.96) shows that the model still captures important patterns with fewer, more realistic features.
📊 Compare Evaluation Scores: Decision Tree Round 1 vs Round 2¶
# Get all CV scores
tree2_cv_results = make_results('decision tree2 cv', tree2, 'auc')
print(tree1_cv_results)
print(tree2_cv_results)
model precision recall F1 accuracy auc
0 Decision Tree CV 0.914552 0.916949 0.915707 0.971978 0.969819
model precision recall F1 accuracy auc
0 decision tree2 cv 0.856693 0.903553 0.878882 0.958523 0.958675
💡 Insight¶
Some evaluation metrics dropped slightly in Round 2, which is expected due to the reduced feature set.
However, the model still performs strongly overall showing that the engineered features retained valuable predictive power while reducing the risk of data leakage.
🌲 Random Forest – Round 2 (with Feature Engineering)¶
# Instantiate the random forest classifier
rf = RandomForestClassifier(random_state=0) # you can keep using rf (no need for rf2 here)
# Define hyperparameter grid
cv_params = {
'max_depth': [3, 5, None],
'max_features': ['sqrt', None],
'max_samples': [0.7, 1.0],
'min_samples_leaf': [1, 2, 3],
'min_samples_split': [2, 3, 4],
'n_estimators': [300, 500]
}
# Define scoring metrics
scoring = ['accuracy', 'precision', 'f1', 'recall', 'roc_auc']
# Set up GridSearchCV
rf2 = GridSearchCV(rf, cv_params, scoring=scoring, cv=4, refit='roc_auc')
🔄 Fit the Random Forest Model (Round 2)¶
%%time
rf2.fit(X_train, y_train)
CPU times: total: 34min 51s Wall time: 35min 13s
GridSearchCV(cv=4, estimator=RandomForestClassifier(random_state=0),
param_grid={'max_depth': [3, 5, None],
'max_features': ['sqrt', None],
'max_samples': [0.7, 1.0],
'min_samples_leaf': [1, 2, 3],
'min_samples_split': [2, 3, 4],
'n_estimators': [300, 500]},
refit='roc_auc',
scoring=['accuracy', 'precision', 'f1', 'recall', 'roc_auc'])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=4, estimator=RandomForestClassifier(random_state=0),
param_grid={'max_depth': [3, 5, None],
'max_features': ['sqrt', None],
'max_samples': [0.7, 1.0],
'min_samples_leaf': [1, 2, 3],
'min_samples_split': [2, 3, 4],
'n_estimators': [300, 500]},
refit='roc_auc',
scoring=['accuracy', 'precision', 'f1', 'recall', 'roc_auc'])RandomForestClassifier(max_samples=0.7, min_samples_leaf=3, n_estimators=300,
random_state=0)RandomForestClassifier(max_samples=0.7, min_samples_leaf=3, n_estimators=300,
random_state=0)💾 Save and Reload Random Forest Round 2 with Pickle¶
# Write pickle
write_pickle(path, rf2, 'hr_rf2')
# Read in pickle
rf2 = read_pickle(path, 'hr_rf2')
🌲 Best Parameters and AUC Score for Random Forest (Round 2)¶
# Check best parameters
rf2.best_params_
{'max_depth': None,
'max_features': 'sqrt',
'max_samples': 0.7,
'min_samples_leaf': 3,
'min_samples_split': 2,
'n_estimators': 300}
# Check best AUC score from cross-validation
rf2.best_score_
0.96873476079196
📊 Compare Evaluation Scores: Decision Tree vs Random Forest (Round 2)¶
# Get evaluation results for Random Forest Round 2
rf2_cv_results = make_results('random forest2 cv', rf2, 'auc')
# Display comparison between decision tree and random forest (Round 2)
print(tree2_cv_results)
print(rf2_cv_results)
model precision recall F1 accuracy auc
0 decision tree2 cv 0.856693 0.903553 0.878882 0.958523 0.958675
model precision recall F1 accuracy auc
0 random forest2 cv 0.911854 0.866715 0.888685 0.963972 0.968735
💡 Insight¶
The scores dropped slightly in this round due to feature engineering, but the random forest still outperforms the decision tree in nearly every metric.
If we focus on AUC as the key indicator, the random forest remains the better choice maintaining strong generalization while reducing the risk of data leakage.
🧪 Final Evaluation on Test Set (Random Forest Round 2)¶
# Get predictions on test data
rf2_test_scores = get_scores('random forest2 test', rf2, X_test, y_test)
rf2_test_scores
| model | precision | recall | f1 | accuracy | AUC | |
|---|---|---|---|---|---|---|
| 0 | random forest2 test | 0.904564 | 0.875502 | 0.889796 | 0.963976 | 0.928551 |
💡 Insight¶
This seems to be a stable, well-performing final model.
Its performance on the test set closely mirrors the cross-validation results, suggesting it generalizes well to unseen data.
📉 Confusion Matrix for Random Forest (Round 2)¶
# Generate array of values for confusion matrix
preds = rf2.best_estimator_.predict(X_test)
cm = confusion_matrix(y_test, preds, labels=rf2.classes_)
# Plot confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=rf2.classes_)
disp.plot(values_format='');
💡 Insight¶
The model predicts more false positives (46) than false negatives (62), which means some employees may be incorrectly flagged as leaving when they won’t.
Despite that, the True Positives (436) and True Negatives (2454) reflect that this is still a strong model with reliable predictive power.
🌲 Decision Tree Splits (Round 2)¶
# Plot the tree
plt.figure(figsize=(85,20))
plot_tree(tree2.best_estimator_, max_depth=6, fontsize=14, feature_names=X.columns,
class_names={0:'stayed', 1:'left'}, filled=True);
plt.show()
📝 Note: You can double-click the tree output in Jupyter to zoom in and inspect the individual splits.
🌿 Feature Importance (Decision Tree Round 2)¶
# Get feature importances
tree2_importances = pd.DataFrame(tree2.best_estimator_.feature_importances_,
columns = ['gini_importance'],
index = X.columns)
# Filter and sort
tree2_importances = tree2_importances.sort_values(by='gini_importance', ascending=False)
tree2_importances = tree2_importances[tree2_importances['gini_importance'] != 0]
# Visualize
tree2_importances.plot(kind='barh', figsize=(8,5), legend=False)
plt.title("Feature Importance (Decision Tree Round 2)")
plt.xlabel("Gini Importance")
plt.gca().invert_yaxis()
plt.show()
Barplot to visualize the decision tree feature importances.
sns.barplot(data=tree2_importances, x="gini_importance", y=tree2_importances.index, orient='h')
plt.title("Decision Tree: Feature Importances for Employee Leaving", fontsize=12)
plt.ylabel("Feature")
plt.xlabel("Importance")
plt.show()
💡 Insight¶
The bar plot above shows that in this decision tree model, last_evaluation, number_project, tenure, and overworked have the highest importance in that order.
These variables contributed the most to predicting whether an employee would leave the company, according to Gini importance.
🌲 Random Forest Feature Importance (Round 2)
Now, plot the feature importances for the random forest model.
# Get feature importances
feat_impt = rf2.best_estimator_.feature_importances_
# Get indices of top 10 features
ind = np.argpartition(rf2.best_estimator_.feature_importances_, -10)[-10:]
# Get column labels of top 10 features
feat = X.columns[ind]
# Filter `feat_impt` to consist of top 10 feature importances
feat_impt = feat_impt[ind]
y_df = pd.DataFrame({'Feature':feat,'Importance':feat_impt})
y_sort_df = y_df.sort_values('Importance')
fig = plt.figure()
ax1 = fig.add_subplot(111)
y_sort_df.plot(kind = 'barh', ax=ax1, x = 'Feature', y = 'Importance')
ax1.set_title('Random Forest: Feature Importances for Employee Leaving', fontsize = 12)
ax1.set_ylabel('Feature')
ax1.set_xlabel('Importance')
plt.show()
💡 Insight¶
The plot above shows that in this random forest model, number_project, tenure, last_evaluation, and overworked have the highest importance, in that order.
These features are most helpful in predicting whether an employee will leave and they align closely with those identified in the decision tree model.
pacE: Execute Stage¶
- Interpret and communicate model performance
- Share actionable, data-driven recommendations with stakeholders
✏️ Evaluation Metrics Overview¶
- AUC: Measures how well the model distinguishes between classes.
- Precision: Proportion of positive predictions that are actually correct.
- Recall: Proportion of actual positives correctly identified.
- Accuracy: Overall proportion of correct predictions.
- F1-score: Harmonic mean of precision and recall useful for imbalanced classes.
Step 4. Final Results & Business Impact¶
- Interpret model results with business context
- Use performance metrics to validate model quality
- Present key findings and recommendations to stakeholders
📊 Summary of Model Results¶
Logistic Regression
Achieved an accuracy of 83% on the test set, with weighted averages of:
- Precision: 80%
- Recall: 83%
- F1-score: 80%
Decision Tree (Round 2)
After feature engineering, the decision tree model achieved:
- AUC: 93.8%
- Accuracy: 96.2%
- Precision: 87.0%
- Recall: 90.4%
- F1-score: 88.7%
Random Forest (Round 2)
The Round 2 Random Forest model outperformed earlier models across most evaluation metrics and demonstrated strong generalization on unseen data, indicating robust predictive capability.
✅ Conclusion, Recommendations, and Next Steps¶
The models and feature importances indicate that employee attrition is strongly influenced by overwork, project load, tenure, and evaluation scores.
💼 Business Recommendations¶
- Set a cap on the number of simultaneous projects per employee.
- Investigate why employees with four-year tenures show high dissatisfaction.
- Offer clear rewards or compensation for extended working hours.
- Improve communication around overtime pay and workload expectations.
- Conduct team-wide and company-wide discussions to address work culture and burnout.
- Reconsider how performance is evaluated high scores should not only be tied to extreme workloads.
✅ Next Steps¶
- Validate the
last_evaluationfeature to ensure it does not introduce data leakage. Confirm that its values reflect pre-attrition performance and do not encode post-decision signals. - Experiment with removing potentially sensitive features such as
last_evaluationorsatisfaction_level, and evaluate how model performance is affected. This will help test the model’s robustness and reduce reliance on features that may not be available in real-world scenarios. - Explore alternative modeling goals, such as predicting performance or identifying early signs of disengagement, to expand the business value beyond attrition forecasting.
- Apply unsupervised learning techniques, like K-means clustering, to identify natural groupings of employees. These insights can support more targeted retention strategies and workforce planning.
💼 Let's Connect!¶
Thank you for reading! If you’d like to learn more about my work or connect professionally, feel free to visit my GitHub Portfolio or reach out on LinkedIn.
Your feedback is always welcome — I'm constantly evolving as a data analyst!